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G\ Abstract 

Open-text (or open-domain) semantic parsers are designed to interpret any 
i | statement in natural language by inferring a corresponding meaning representa- 

tion (MR). Unfortunately, large scale systems cannot be easily machine-learned 
^L^ due to lack of directly supervised data. We propose here a method that learns to 

^2 assign MRs to a wide range of text (using a dictionary of more than 70,000 words, 

O which are mapped to more than 40,000 entities) thanks to a training scheme that 

combines learning from WordNet and ConceptNet with learning from raw text. 
I The model learns structured embeddings of words, entities and MRs via a multi- 

►> task training process operating on these diverse sources of data that integrates all 

/-/-\ the learnt knowledge into a single system. This work ends up combining methods 

\Q for knowledge acquisition, semantic parsing, and word-sense disambiguation. Ex- 

\Q periments on various tasks indicate that our approach is indeed successful and can 

C*") form a basis for future more sophisticated systems. 

o 



1 Introduction 

A key ambition of AI has always been to render computers able to read text and express 
its meaning in a formal representation in order to bring about a major improvement in 
human-computer interfacing, question answering or knowledge acquisition. Semantic 
parsing [25 1 precisely aims at building such systems to interpret statements expressed 
in natural language. The purpose of the semantic parser is to analyze the structure of 
sentence meaning and, formally, this consists of mapping a natural language sentence 
into a logical meaning representation (MR). This task seems too daunting to carry 
out manually (because of the vast quantity of knowledge engineering that would be 
required) so machine learning seems an appealing avenue. On the other hand, machine 
learning models usually require many labeled examples, which can also be costly to 
gather, especially when labeling properly requires the expertise of a linguist. 

Hence, research in semantic parsing can be roughly divided in two tracks. The first 
one, which could be termed in-domain, aims at learning to build highly evolved and 
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comprehensive MRs [ 16, 38 21 1. Since this requires advanced training data, such ap- 
proaches have to be applied to text from a specific domain with restricted vocabulary 
(a few hundred words). Alternatively, a second line of research, which could be termed 
open-domain, works towards learning to associate a MR to any kind of natural lan- 
guage sentence OTl[T7ll28l . In this case, the supervision is much weaker because it is 
unrealistic and infeasible to label data for large-scale, open-domain semantic parsing. 
As a result, models usually infer simpler MRs; this is sometimes referred to as shallow 
semantic parsing. 

In this paper, we propose a novel method directed towards the open-domain cate- 
gory with the aim of automatically inducing meaning representations out of free text, 
by exploiting existing resources such as WordNet to bootstrap and anchor the process. 
For a given sentence, the proposed approach infers a MR in two stages: (1) a semantic 
role labeling (SRL) step predicts the semantic structure, and (2) a disambiguation step 
assigns a corresponding entity to each relevant word, so as to minimize an energy given 
to the whole input. 

This paper considers simple MR structures and relies on an existing method to per- 
form SRL because its focus is on step (2). Indeed, in order to go open-domain, a large 
number of entities must be considered. For this reason, the set of entities considered 
is defined from WordNet [24]. This results in a dictionary of more than 70,000 words 
that can be mapped to more than 40,000 possible entities. For each word, WordNet 
provides a list of candidate senses so step (2) reduces to detecting the correct one and 
can be seen as a challenging all-words word-sense disambiguation (WSD) step. 

The model used here builds upon the structured embedding framework, defined in 
0, in which each entity of a knowledge base (such as WordNet) is encoded into a 
low dimensional embedding vector space preserving the original data structure. The 
training procedure is based on multi-task learning across different knowledge sources 
including WordNet, ConceptNet ll22l and raw text. In this way MRs induced from raw 
text and MRs for WordNet entities are embedded (and hence integrated) in the same 
space. This allows us to learn to perform disambiguation on raw text with little direct 
and much indirect supervision. The model can learn to use WordNet and ConceptNet 
knowledge (such as relations between entities) to help choose the correct sense of a 
particular word, and then label the words from the raw text with the WordNet sense. 
At the same time, MR prediction can also be seen as knowledge extraction. In addition 
to extracting MRs from raw text, the model proposed here has the potential to enrich 
WordNet with the extracted MRs. The proposed method is evaluated on different cri- 
teria to reflect its different properties. Thus, presented results illustrate MR inference, 
word-sense disambiguation, WordNet encoding and enrichment. 

The paper is organized as follows. Section [2] describes our framework to perform 
semantic parsing. Sectionplintroduces our model based on structured embeddings and 
Section|4]the multi-task training process we used to learn it. Section[5]discusses some 
related work. Finally our experiments are presented in Section|6] 



2 Semantic Parsing Framework 

2.1 Definitions 

The MRs considered in semantic parsing are simple logical expressions of the form 
REL(Aq, . . . , A n ). REL is the relation symbol, and Aq, ..., A n are its arguments. 
Note that several forms can be recursively constructed to form more complex struc- 
tures. Because this work is oriented towards raw text, a wide range of possible relation 
types and arguments must be considered. 

Hence, WordNet |24| is used for defining the arguments and some relation types as 
proposed in iFJTI . WordNet encompasses comprehensive knowledge within its graph 
structure, whose nodes (termed synsets) correspond to senses, and edges (which can 
have different types) define relations between those senses. Each synset is associated 
with a set of words sharing that sense. They are usually identified by 8 -digit codes, 
however, for clarity reasons, we indicate a synset by the concatenation of one of its 
words, its part-of-speech tag and a number indicating which sense it refers to (in the 
case of polysemous words). For example, score JNNA refers to the synset representing 
the first sense of the word "score" and also contains the words "mark" and "grade", 
whereas _score_NNJ2 refers to the second meaning (i.e. a written form of a musical 
composition). 

We denote instances of relations from WordNet using triplets (Ihs, rel, rhs), where 
Ihs depicts the left-hand side of the relation, rel its type and rhs its right-hand side. Ex- 
amples are {score -NN-1 , Jiypernym, .evaluation -NN-1) or (_score_AW_2 , Jiasjyart, 
jnusicaljiotationJJNJ). In this work we filter out the synsets appearing in less that 
15 triplets, as well as relation types appearing in less than 5000 triplets. We obtain a 
graph with the following statistics: 41,024 synsets and 18 relations types; a total of 
70,1 16 different words belong to these synsets. 

2.2 Methodology 

MR structure inference (and preprocessing) The first stage consists in preprocess- 
ing the text and inferring the structure of the MR. Using the SENNA softwareQl 12 1, we 
performed part-of-speech (POS) tagging, chunking, lemmatizatiorQand semantic role 
labeling (SRL). The SRL step consists in labeling, for each proposition, each semantic 
argument associated with a verb with its grammatical role. Each argument is specified 
by a tuple of lemmas. It is crucial because it will be used to infer the structure of the 
MR. In this restricted setting, the structure of the MR follows that of the sentence. 

We only consider sentences that match the following template: (subject, verb, direct 
object). Here, each of the three elements of the template is associated with a tuple of 
lemmatized words or synsets (when the words are disambiguated). SRL is used to 
structure the sentence into the (Ihs = subject, rel = verb, rhs = object) template, note 
that the order is not necessarily subject / verb / direct object in the raw text. The seman- 
tic match energy function is used to predict appropriate synsets or answer questions by 
choosing those corresponding to low-energy synset configurations. 



'Freely available from ml . nec-labs . com/senna/ 

2 lemmatization is not carried out with SENNA but with the NLTK toolkit, nltk . org 



Clearly, the subject-verb-object structure causes the resulting MRs to have a straight- 
forward structure (with a single relation), but this pattern is the most common and a 
good choice to test our ideas at scale. Learning to infer more elaborate grammatical 
patterns is left as future work. In this work we chose to focus on handling the large 
scale of the set of entities. 

As an illustration, to parse the sentence: "A musical score accompanies a televi- 
sion program or a film.", the SRL step will produce as output the following triplet 
{jnusicalJJ scoreJNN, accompany JVB, Jelevisionj?rogramJ\fN JilmJNN). In the 
following, we call the concatenation of a lemmatized word and POS tag (such as NN, 
VB, etc.) a lemma. Note the absence of an integer suffix, which distinguishes a lemma 
from a synset: a lemma is allowed to be semantically ambiguous. To summarize, this 
step starts from a sentence and either rejects it or outputs a triplet of lemma tuples, one 
for the subject, one for the relation or verb, and one for the direct object. 

Detection of MR entities This second step aims at identifying each semantic entity 
expressed in a sentence. Given a relation triplet (lhs lem , rel lem , rhs lem ) where each 
element of the triplet is associated with a tuple of lemmas, a corresponding triplet 
{lhs syn , rel syn , rhs syn ) is produced, where the lemmas are replaced by synsets. 
Depending on the lemmas, this can be either straightforward (some lemmas such as 
Jelevision_program_NN or _world.warJi_NN correspond to a single synset) or very 
challenging (_run_VB can be mapped to 41 different synsets and _run_NN to 16). Hence, 
in the proposed semantic parsing framework, MRs correspond to triplets of synsets 
{lhs syn , rel syn , rhs syn ). For the example from the previous section, the associated 
MR is {{jnusicalJJ J, jcore_AW_2), -accompany J/B J , {-television jprogramJ^N J , 
jUmJVNJJ). 

This step can be seen as particular form of all-words word-sense disambiguation. 
This is achieved by an approximate search for a set of synsets that are compatible with 
the observed lemmas and that also minimize a semantic matching energy function, 
using the model described in the next section. 

Since the model is structured around relation triplets, MRs and WordNet relations 
are cast into the same scheme. For example, the WordNet relation {_score_NNJ2 , 
Jias_part, jnusicaljiotationjNN J) fits the same pattern as our MRs, with the relation 
type Jiasjiart playing the role of the verb. 



3 Structured Embeddings 

Inspired by the framework introduced by Bordes et al. [6| as well as by recent work of 
L. Bottou [8 1, the main idea behind our structural embedding model is the following. 

• Named symbolic entities (including WordNet synsets and relation types and lem- 
mas) are associated with a d-dimensional vector space, termed the "embedding 
space", following previous work in neural language models (see Q for a review). 
The i th entity is assigned a vector E t € R d . Note that if a lemma is unambigu- 
ous because it maps to a single synset, its embedding and the embedding of this 
synset are shared. 



• The semantic energy function value associated with a particular triplet (Ihs, rel, 
rhs) is computed by a parametrized function £ that starts by mapping all of the 
symbols to their embeddings. Note that in our case £ must be able to handle 
variable-size arguments, since for example there could be multiple lemmas in 
the subject part of the sentence. 

• The energy function £ is optimized to be lower for training examples than for 
other possible configurations of symbols. Hence the semantic energy func- 
tion can distinguish plausible combinations of entities from implausible ones, 
to choose the most likely sense for a lemma, or to answer questions, e.g. corre- 
sponding to a tuple (Ihs, rel, ?) with a missing rhs entry "?". 

3.1 Training Objective 

Let us now more formally define the training criterion for the semantic match energy 
function. Let C denote the dictionary which includes all entities (lemmas and synsets) 
and relation types of interest, and let C* denote the set of tuples (or sequences) whose 
elements are taken in C. Let TZ C C be the subset of entities which are relation types 
(R,* is defined similarly as C*). We are given a training set T> containing m triplets of 
the form x = {xi hs ,x re i,x rhs ), where x lhs e C* , x rel £ 11*, and x rhs e C* . We 
define the energy as £ (x) = £(xih s ,x re i,x r hs)- Ideally, we would like to perform 
maximum likelihood over P(x) oc e~ £ ^' but this is intractable. The approach we 
follow here has already been used successfully in ranking settings lfTTll35l 341 and cor- 
responds to performing two approximations. First, like in pseudo-likelihood we only 
consider one input given the others. Second, instead of sampling a negative exam- 
ple from the model posterior, we use a ranking criterion (that is based on uniformly 
sampling a negative example). 

If one of the elements of a given triplet were missing, then we would like the 
model to be able to predict the correct entity. For example, this would allow us to 
answer questions like "what is part of a car?" or "what does a score accompany?". 
The objective is to learn a real-valued semantic energy function £ such that it can 
successfully rank the training samples below all other possible triplets: 

£{x) < £{i, x reh x rhs ) Vi € C* : (i, x re i,x rhs ) <£ V (1) 

£(x) < £(x ihs ,k, Xrhs) Vfc ell* : {x ihs ,k, x rhs ) <£ V (2) 

£(x) < £(x lhs ,x re i,j) \fj e C* : (xihs,Xrel,j) i- V (3) 

In practice the following stochastic criterion is minimized: 

J2 E nu«(£(aO -£(£) + 1,0) (4) 

where Q(x\x) is a corruption process that transforms a training example x into a cor- 
rupted negative example. In the experiments Q only changes one of the three members 
of the triplet, by changing only one of the lemmas, synsets or relation type in it, by 
sampling it uniformly from C (not actually checking if the negative example is in T>). 
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Figure 1: Semantic matching energy function. A triple of tuples (Ihs, rel, rhs) is 
first mapped to its embeddings Eihs, Eihs and Eihs (using an aggregating function for tuples 
involving more than one symbol). Then Eihs and E re i are combined using gieft{-) to output 
Ei h3 ( re i) (similarly E rha ( re i) - g r i a ht{E rhs ,E re i)). Finally the energy £((lhs, rel, rhs)) is 
obtained by merging Ei hs ^ re i) and E rh s(rei) with the h(.) function. 



3.2 Parametrization of the Semantic Matching Energy Function 

Many parametrizations are possible for the semantic matching energy function but we 
have explored only a few. Let the input triplet be x — ((lhs\,lhs2, ■ ■ ■), (rel\,reli, . . .), 
(rhsi,rhs2, ■ ■ ■))■ In all of the experiments, the energy function is structured as fol- 
lows, based on the intuition that the relation type should first be used to extract relevant 
components from each argument's embedding, and put them in a space where they can 
then be compared (see FigurefTlfor an illustration). 

(1) Each symbol i in the input tuples is mapped to its embedding Ei £ M. d . 

(2) The embeddings associated with all the symbols within the same tuple are ag- 
gregated by a pooling function tt (we only used the mean in the experiments but 
other plausible candidates include the sum, the max, and combinations of several 
such elementwise statistics): 
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where Ihsi denotes the i-th individual element of the left-hand side tuple, etc. 

(3) The embeddings E^s and E re i respectively associated with the Ihs and rel ar- 
guments are used to construct a new relation-dependent embedding E lhs ( re ij for 
the Ihs in the context of the relation type represented by E re i, and similarly 



for the r hs: E lhs ^ rel ^ — gi e ft(Eihs,E re i) and E r h s ^ re ^ — g r i g ht(E r hs,E re i), 
where gi e ft and g r i g ht are parametrized functions whose parameters are tuned 
during training. See more details in the experiments section. 

(4) The energy is computed from the transformed embeddings of the left-hand side 
and right-hand side: £{x) = h(Ei hs ( rel j, E rhs ^ re ^), where h is a parametrized 
function whose parameters are tuned during training. More details are given in 
the experiments section. 

3.3 Disambiguation Process 

Our semantic matching energy function is used for raw text semantic to perform 



stage 2 of the protocol described in Section 2.2 that is to carry out the word-sense 
disambiguation step. 

We label a triplet of lemmas ((lhs[ em , lhs l 2 em , . . .), (rel[ em , . . .), (rhs[ em , . . .)) 
with synsets in a greedy fashion, one lemma at a time. For labeling Ihs^" 1 for in- 
stance, we fix all the remaining elements of the triplet to their lemma and select the 
synset leading to the lowest energy: 

IhsT = a rgmin SeC{synllem) £((lh S l r, S, . . .), (r<"\ . . .), (rhs 1 ^, • ■ •)) (5) 

with C(syn\len) the set of allowed synsets to which Ihs 1 ^" 1 can be mapped. We repeat 
that for all lemmas. We always use lemmas as context, and never the already assigned 
synsets. This process is interesting because it is efficient as it only requires to compute 
a low number of energies, equal to the number of senses for a lemma. However, it 
requires to have good representations (i.e. good embedding vectors) for synsets and 
lemmas. That is the reason why the multi-task training presented in next section, takes 
good care of learning both properly. 



4 Multi-Task Training 
4.1 Multiple Data Resources 

In order to encode as much common-sense knowledge as possible in the model, the 
following heterogeneous data sources are combined. 



WordNet v3.0 (WN). Described in Section 2. 1 this is the main resource, defining the 
dictionary of entities. The 18 relation types and 40,989 synsets retained are composed 
to form a total of 221,017 triplets. We randomly extracted from them a validation and 
a test set with 5,000 triplets each. 

WordNet contains only relations between synsets. However, the disambiguation 
process needs embeddings for synsets and for lemmas. Following [19|, we created 
two other versions of this dataset to leverage WN in order to also learn lemma em- 
beddings: "Ambiguated" WN and "Bridge" WN. In "Ambiguated" WN both synset 
entities of each triplet are replaced by one of their corresponding lemmas. "Bridge" 



WN is designed to teach the model about the connection between synset and lemma 
embeddings, thus in its relations the Ihs or rhs synset is replaced by a corresponding 
lemma. Sampling training examples from WN involves actually sampling from one of 
its three versions, resulting in a triplet involving synsets, lemmas or both. 



ConceptNet v2.1 (CN). CN [22| is a common-sense knowledge base in which lem- 
mas or groups of lemmas are linked together with rich semantic relations as, for exam- 
ple, (JcitchenJable _AW, jusedjor, -eatJVB Jbreakfast-NN). It is based on lemmas and 
not synsets, and it does not make distinctions between different senses of a word. Only 
triplets containing lemmas from the WN dictionary are kept, to finally obtain a total of 
1 1,332 training lemma triplets. 

Wikipedia (Wk). This resource is simply raw text meant to provide knowledge to 
the model in an unsupervised fashion. In this work 50,000 Wikipedia articles were 
considered, although many more could be used. Using the protocol of the first para- 



graph of Section 2.2 we created a total of 1,484,966 triplets of lemmas. Imperfect 
training triplets (containing a mix of lemmas and synsets) are produced by performing 
the disambiguation step of Section [33] on one of the lemmas. This is equivalent to 
MAP (Maximum A Posteriori) training, i.e., we replace an unobserved latent variable 
by its mode according to a posterior distribution (i.e. to the minimum of the energy 
function, given the observed variables). We have used the 50,000 articles to generate 
more than 3M examples. 

Extended WordNet (XWN) and Unambiguous Wikipedia (Wku). XWNGD is 
built from WordNet glosses, syntactically parsed and with content words semantically 



linked to WN synsets. Using the protocol of Section 2.2 we processed these sentences 
and collected 47,957 lemma triplets for which the synset MRs were known. We re- 
moved 5,000 of these examples to use them as an evaluation set for the MR entity 
detection/word-sense disambiguation task. With the remaining 42,957 examples, we 
created unambiguous training triplets to help the performance of the disambiguation 



algorithm described in Section 3.3 for each lemma in each triplet, a new triplet is 
created by replacing the lemma by its true corresponding synset and by keeping the 
other members of the triplet in lemma form (to serve as examples of lemma-based con- 
text). This led to a total of 786,105 training triplets, from which we removed 10,000 
examples to build a validation set. 

We added to this training set some triplets extracted from the Wikipedia corpus 
which were modified with the following trick: if one of its lemmas corresponds unam- 
biguously to a synset, and if this synset maps to other ambiguous lemmas, we create 
a new triplet by replacing the unambiguous lemma by an ambiguous one. Hence, we 
know the true synset in that ambiguous context. This allowed to create 981,841 addi- 
tional triplets with supervision, and we named this data set Unambiguous Wikipedia. 



4.2 Training Procedure 

To train the parameters of the energy function £ we loop over all of the training data 
resources and use stochastic gradient descent (SGD) BUI . That is, we iterate the fol- 
lowing steps: 

1. Select a positive training triplet Xi at random (composed of synsets, of lemmas 
or both) from one of the above sources of examples. 

2. Select at random resp. constraint ([TJ, |2]i or (J3). 

3. Create a negative triplet x by sampling an entity from C to replace resp. Ihsi, 
reli or rhsi. 

4. Iff (x^ > £ (x) — 1, make a stochastic gradient step to minimize the criterion Q. 

5. Enforce the constraint that each embedding vector is normalized, ||-Ej|| = 1, VI 

The constant 1 in step [4] is the margin as is commonly used in many margin- 
based models such as SVMs [7|. The gradient step requires a learning rate of A. The 
normalization in stepBlhelps remove scaling freedoms from the model. 

The above algorithm was used for all the data sources except XWN and Wku. In 
that case, positive triplets are composed of lemmas (as context) and of a disambiguated 
lemma replaced by its synset. Unlike for Wikipedia, this is labeled data, so we are 
certain that this synset is the true sense. Hence, to increase training efficiency and 
yield a more discriminant disambiguation, in stepplwith probability | we either sample 
randomly from C or we sample randomly from the set of remaining candidate synsets 
corresponding to this disambiguated lemma (i.e. the set of its other meanings). 

The matrix E which contains the representations of the entities is thus learnt via a 
complex multi-task learning procedure because a single embedding matrix is used for 
all relations and all data sources (each really corresponding to a different distribution 
of symbol tuples, i.e., a different task). As a result, the embedding of an entity contains 
factorized information coming from all the relations in which the entity is involved as 
Ihs, rhs or even rel (for verbs). For each entity, the model is forced to learn how it 
interacts with other entities in many different ways. 

5 Related Work 

Our approach is original by the way that it connects many tasks and many training re- 
sources within the same framework. However, it is highly related with many previous 
works. Shi and Mihalcea [ 3 1 1 proposed a rule-based system for open-text semantic 
parsing using WordNet and FrameNet [2| while Giuglea and Moschitti [17| proposed 
a model to connect WordNet, VerbNet and PropBank |20| for semantic parsing us- 
ing tree kernels. Poon and Domingos [28 29] recently introduced a method based on 
Markov-Logic Networks for unsupervised semantic parsing that can be also used for 
information acquisition. However, instead of connecting MRs to an existing ontology 
the proposed method does, it constructs a new one and does not leverage pre-existing 



knowledge. Automatic information extraction is the topic of many models and de- 
mos Il32ll37ll36l[33ll but none of them relies on a joint embedding model. In that trend, 
some approaches have been directly targeting to enrich existing resources, as we do 
here with WordNet, (DEIEO) but these never use learning. Finally, several previous 
works have targeted to improve WSD by using extra-knowledge by either automatically 
acquiring examples [23 1 or by connecting different knowledge bases |[T9l . 

Our model is related to earlier approaches (e.g. [4, 11, 26. 9]) and is similar to but 
more convenient than the approach of Bordes et al. [6], where the embeddings for the 
left/right-hand side arguments i and j are d-vectors and the embedding for relation k 
is a pair of d X d matrices. The disadvantage of embedding each relation type into a 
pair of matrices is that it gives relation types a different status (they cannot appear as 
left-hand side or right-hand side) and many more parameters. 

6 Experiments 

6.1 Experimental Setting 

Experiments were performed with three different types of parametrizations (linear, bi- 
linear and non-linear) for the g functions. We selected the hyper-parameter values 
w.r.t. the WN validation set. We only present in this section results with the bilinear 
parametrization for g and with a dot product for the output h function, h(a, b) = a ■ b, 
because this combination achieved the best performance in validation. The bilinear 
function for the left side (and similarly for the right side) is as follows: 

g left (E lh3 , E rel ) 1 = J2( W ^n * (Wih f t*EJ he + b k lleft ) * (W£ ft *Ei el + b% left ) + b l 3left ) 

i,j,k 

where i denotes indices of the elements of the embedding Eihs, j of the relation em- 
bedding E re i, k of the latent representation, and I of the output of the g function 
(that will be fed into the dot product). We hypothesize that the success of the bilinear 
parametrization comes from its natural ability to encode AND relationships between 
the Ihs (or rhs) and the rel embeddings. 

To assess the performance w.r.t. choices made with that architecture, the multi- 
task training and the diverse data sources, we evaluated models trained with several 
combinations of data sources. WN denotes models trained on WordNet, "Ambiguated" 
WordNet and "Bridge" WordNet, WN+CN+Wk models also trained on CN and Wk 
datasets, and All models are trained on all sources. 

6.2 WordNet Encoding 

The WN encoding is measured with the mean predicted rank and the prediction at top 
10 (top- 10), calculated with the following procedure. For each test WordNet triplet, 
the left entity is removed and replaced by each of the 41,024 synsets of the dictionary 
in turn. Energies of those degraded triplets are computed by the model and sorted by 
ascending order and the rank of the correct synset is stored. That is done for both the 
left-hand and right-hand arguments of the relation. The mean predicted rank is the 
average of those predicted ranks and top- 10 is the proportion of ranks within 1 and 10. 
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Table 1 : WordNet encoding and Word Sense Disambiguation results. MFS is just 
using the Most Frequent Sense. All+MFS is our best system, combining all sources 
of information. Random chooses uniformly among allowed synsets. (*)Results of 
StructEmbed, copied from |6|, were obtained with a different version of WordNet and are 
presented here as indication only. 



Model 


WordNet rank 


WordNet top-10 


F1XWN 


Fl Senseval3 


All+MFS 


- 


- 


72.35% 


70.19% 


All 


139.3 


34.71% 


67.52% 


51.44% 


WN+CN+Wk 


95.9 


46.02% 


34.80% 


34.13% 


WN 


72.1 


58.87% 


29.55% 


28.36% 


MFS 


- 


- 


67.17% 


67.79% 


Gamble (B) 


- 


- 


- 


66.41% 


StructEmbed |6) 


140.1* 


74.20%* 


- 


- 


Random 


20512 


0.024% 


26.71% 


29.55% 



The left side of Table [T] presents the comparative results, together with previous 
performance of [6 1 (StructEmbed). We obtain better predicted rank but lower top-10. 
However, the difference can be caused by the fact that the dictionaries of synsets are 
slightly different between [6] and the setting presented here (different preprocessing of 
WordNet). Training with other data sources (All) still allows to encode well WordNet 
knowledge, even if it is slightly worse than with WordNet alone (WN). 



6.3 Word Sense Disambiguation 

Performance on WSD is assessed on two test sets: the XWN test set and a subset of 
English All-words WSD task of SensEval-3r]For the latter, we processed the original 
data using the protocol of Section 2.2 and obtained a total of 208 words to disambiguate 
(out of m 2000 originally). The performance of the most frequent sense (MFS) based 
on WordNet frequencies is also evaluated. Finally, we also report the results of Gam- 
ble lfl5ll . winner of Senseval-3, on our subset of its data. 

Fl scores are presented in TablefTI(right). The difference between All and WN+CN 
+Wk indicates that, even without direct supervision, the model can disambiguate some 
words (WN+CN+Wk is significantly above Random), that the information from XWN 
and Wku is crucial (+30%) and yields performance better than MFS (a strong baseline 
in WSD) on the XWN test set. 

However, performance can be greatly improved by combining the All sources model 
and the MFS score. To do so, we converted the frequency information into an energy 
by taking minus the log frequency and used it as an extra energy term. The total en- 
ergy function is used for disambiguation. This yields the results denoted by All+MFS 
which achieves the best performance of all the methods tried. 



i . senseval . org/senseval3 
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Figure 2: MR ranking with respect to the model energy on the XWN test set. 

On a total of 41,024 synset entities, the median/mean rank is 640/3012 whereas with Word- 
Net:: Similarity we obtained 13,200/13,800. 



6.4 Meaning Representations 

Assessing the quality of the obtained MRs is difficult as there is no benchmark for 
open-text semantic parsing. Yet, we intend to give an insight of how well the model 
represents language information by measuring and ranking the energy function of MRs. 
We measure the ranks of the predicted and the correct MRs obtained from the XWN 

we replace each lemma 



test set, like WordNet triplets were treated in Section 6.2 



of a triplet by all the 41,024 synsets to rank (by energy) both correct and predicted 
synsets (according to AU+MFS). In both cases we obtain comparable median (mean) 
ranks: 640 (3012)/41,024 for the correct representation and 516 (2443)/41,024 for the 
prediction. Figure [2] shows a histogram of the ranks of each correct MR of the test set 
for All+MFS. These low ranks indicates that the model has learnt to give lower energies 
to plausible MRs, i.e. has integrated higher level information about how synsets and 
lemmas can form sentences in language. 

Interestingly, the WordNet: Similarity package [27| can also be used to rank MRs 
because it is designed to compute a similarity between synsets using the WordNet 
graph. We used the package's second order co-occurrence vector of synset definitions 
to compute similarity, which gave the best results. This leads to 13,500 (14,000)/41, 024 
median (mean) ranks for the correct representations. Even though, these ranks are 
above chance, they are far worse than those of our model. This is mainly caused by 
the fact that WordNet: Similarity only knows about relations between language entities 
through the WordNet graph, for which the number of relation type is very low (« 20). 
For instance, it has no clue that a noun and a verb can (and how they can) be related, 
contrary to our model that learns that through its multi-task training on raw text. 
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Model (All) 


TextRunner 


Ihs 


_army_NN_l 


army 


rel 


_attack_VB_l 


attacked 




_troop_NN_4 


Israel 


top 


_armed_serviceJSfN_l 


the village 


ranked 


_ship_NN_l 


another army 


rhs 


.territory JSIN_1 


the city 




_military_unit_NN_l 


the fort 




_business_firm_NN_l 


People 


top 


_person_NN_l 


Players 


ranked 


_family_NN_l 


one 


Ihs 


.payoff _NN_3 


Students 




_card_gameJSIN_l 


business 


rel 


_earn_VB_l 


earn 


rhs 


_money_NN_l 


money 



Table 2: Lists of entities reported by our system and by TextRunner. 



6.5 WordNet Enrichment 

WordNet and ConceptNet use a limited number of relation types. Thanks to its multi- 
task training and its unified representation for MRs and WordNet/ConceptNet rela- 
tions, our model is able to learn rich relationships between synsets and lemmas and can 
even enrich those knowledge bases since it sees every verb as a potential relation type. 
Therefore our model is able to learn richer relationships between synsets and lemmas 
entities. As illustration, predicted lists for relation types that do not exist in the two 
knowledge bases are given in Table [2] We also compare with lists returned by Tex- 
tRunner [37 1 (an information extraction tool having extracted information from 100M 
webpages, to be compared with our 50k Wikipedia articles). Lists from both systems 
truly reflect common-sense. However, contrary to our system, TextRunner does not 
disambiguate different senses of a lemma, and thus it cannot connect its knowledge to 
an existing resource to enrich it. 

7 Conclusion 

In this work we developed a large-scale system for semantic parsing from raw text 
to disambiguated meaning representations. The generalization ability of our method 
crucially centers upon scoring triplets of relations between ambiguous lemmas and un- 
ambiguous concepts (synsets) both using a single structured embedding energy func- 
tion. Multi-tasking the learning of such a function over several resources we effectively 
learn to build disambiguated meaning representations from raw text with little direct 
supervision. 

The final system can potentially capture the deep semantics of sentences in the 
structured embedding energy function by generalizing the knowledge learnt across the 
multiple resources (e.g. common-sense knowledge from ConceptNet and relations be- 
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tween concepts from WordNet) and linking it to raw text (from Wikipedia). We ob- 
tained positive experimental results on several semantic tasks that appear to support 
this assertion, but future work should explore the capabilities of such systems further 
including other semantic tasks, and utilizing more evolved grammars, e.g. by using 
FrameNet [2| (see e.g. [13|). 
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